\subsection{Qualitative Analysis}
In addition to the high level quantitative analysis we also explored the results
at a lower level in order to try and identify trends or phenomenon that can
give us some insight as to what makes responses look more computer or human like.

Unfortunately, we could not gather any clear, concrete conclusions. The results to the
responses vary without a clear distinction between clearly ``correct"" (those that
could be easily identified as human or computer generated) and clearly ``incorrect''
(those that were not very discernable as human or computer generated) responses.

We analyzed the responses considering several aspects and trying to identifying distinctive
features that can teach us what makes human like responses. First, hand surveying
extremely scored responses (Certainly Human/Computer) we could not see any clear
patterns. Next, we manually tried to identify word or phrases that has close
association (correlation) to low or high scores, again, getting indistinctive
results except for a few ``extreme'' cases (e.g. responses with the phrase
``totally not precise'' as an average score of 6.5 while those without it has an
average score of 4.5, this is still not very good since the former only appears
twice). An example this is presented in Table~\ref{tab:similar_extreme} which shows
two generated sentences which were both generated with KB and with the same sentiment
but have very different scores.

\begin{table}
\center
\begin{tabular}{|p{5cm}|p{2cm}|}
\hline
Response & Rating \\
\hline
Galaxy is decent. The writer is somewhat right. Nexus on the other hand is
awesome... & Certainly Human \\
\hline
Galaxy is okay. The author is right. In general Samsung makes good
products. & Certainly Computer \\
\hline
\end{tabular}%}
\caption{sentences generated with KB and same sentiment and have extreme, opposite scores}
\label{tab:similar_extreme}
\end{table}

In addition, we tried a simple b-gram and tri-gram analysis. We counted all
n-gram occurrence and calculated the probability of the n-gram to appear with
specific score, trying to find n-grams that predominantly appear specific
scores, which also did not raise any clear patterns.

Looking at real (human) responses, similar disambiguate results are observed.
Our aspiration is to learn from the sentences that were easily discernable as
human -- if we can generate such responses, we are expected to have good results.
A few things we did observe were

\begin{itemize}
	\item Very concrete real world knowledge and word action (e.g. “I think it beats the Samsung Galaxy \uline{S3}. So glad \uline{I didn't upgrade} yet” give good human like rating.
	\item Very short, nondescriptive responses (e.g. ``iPhone is great'', ``nice info'') get lower human likeness (just around 4.5).
	\item Still, this is not very conclusive. For example, the response ``this phone sucks!'' which is very short gets a 2.7 rating.
\end{itemize}

All-in-all we did not learn much from the low level response investigation and future work will need to make deeper investigation and use more robust methods in order to really understand the subtleties of what makes a human-like response. 